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Abstract 

We develop a framework for proving lower bounds on computational problems over distri- 
butions, including optimization and unsupervised learning. Our framework is based on denning 
a restricted class of algorithms, called statistical algorithms, that instead of accessing samples 
from the input distribution can only obtain an estimate of the expectation of any given function 
on a sample drawn randomly from the input distribution. Our definition captures many natural 
algorithms used in theory and practice, e.g. moments-based methods, local search, MCMC and 
simulated annealing. Our techniques arc inspired by (and generalize) the statistical query model 
in learning theory, which captur es the complexity of PAC learning using essentially all known 
learning methods [Kearns . 1998j . 



For specific well-known problems over distributions, we give lower bounds on the complexity 
of any statistical algorithm. These include an exponential lower bounds for moment maximiza- 
tion in TZ n , and a nearly optimal lower bound for detecting planted clique distributions when 
the planted clique has size 0(n 1/,2_(5 ) for any constant S > 0. Variants of the latter problem 
have been assumed to be hard to prove hardness for other problems and for cryptographic 
applications. Our lower bounds provide concrete evidence supporting these assumptions. 
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1 Introduction 



Our primary motivation is to establish computational lower bounds on a set of well-known search 
and optimization problems defined over distributions that can be sampled. The traditional approach 
to this is based on reductions to problems conjectured to be intractable. Here we present a new 
approach: we show that a broad class of algorithms, which we refer to as statistical algorithms, 
must have high asymptotic complexity, unconditionally. 



Our definition encompasses ma ny well-known algorithmSj such as EM Dempster et all 119771 ] . 
local searc h, MCMC optimization Tanner and Wong . 1987 . Gelfand and Smith! . 1990| ]. simulated 
annealing [Kirkpatrick et al.l . 119831. ICernvl. 1198511 . as well as first and second order methods for 



linear/convex optimization, e.g. iDunagan and Vempalal 20081 ] . We define this class of algorithms 
and show they must have high complexity for problems such as detecting large planted cliques or 
planted dense subgraphs, maximizing a polynomial over the unit sphere, maximum satisfiability, 
etc. These results rule out many natural approaches to solving these problems in theory and 
provide some practical guidance about when not to use popular and generic heuristics such as 
EM or simulated annealing. Our work also serves to highlight the question: what nonstatistical 
algorithms exist for search and optimization problems? 

The inspir ation for our model comes from the statistical query (SQ) model in learning theory 
Kearns . 19981 ] . where any algorithm that is based only on statistical queries must hav e complexity 



that grows with the statistical query dimension of the hypothesis class being learned [Blum et al 
19941 ] . In particular, this rules out polynomial-time SQ algorithms for learning parities from the 
uniform distribution on { — 1, l} n . Our definition generalizes SQ algorithms which are known to 
capture almost all efficient algorithms for learning. Before we define our model precisely, we mention 
two specific motivating problems. 

Detecting Planted Cliques. In the standard planted clique problem, we are given a graph 
G whose edges are generated by start ing with a ran dom graph G n u2, then "planting" (adding 



edges to make) a clique on k vertices. Jerrum 19921 ] introduced the planted clique problem as a 



potentially easier variant of the classical problem of finding the largest clique in a random graph. 
A random graph G n \j2 contains a clique of size 21ogn with high probability, and a simple greedy 
algorithm can find one of size logn, and it appears hard to find one of size (1 + e)logn for any 
e > 0. Planting a larger clique should make it easier t o find one. The smallest k for whic h such 
a clique can be detected in polynomial time is n(v^n) jAlon et all Il998l . iMcSherrvi . l200l| . using 



an eigenvector -based algorithm. For k > c\Jn logn, simply picking vertices of large degrees suffices 
Kuceral . [l99i ]. One intriguing aspect of this problem is that for any k, there is a quasipolynomial 



algorithm: guess 2 logn vertices from the clique and take all their com mon neighbors . 

Some evidence toward the hardness of the problem was shown by iJerruml 19921 ] who proved 
that a specific approach using a Markov chain cannot be efficient for sm all k. The problem has 
been used to generate cryptographic primitives Juels and Peinadol . 2000 1, as well as demonstrate 
the hardness of finding approxim ate Nash equilibria of certain games Hazan and Krauthgamer . 



20111 . iMinder and Vilenchild . 120091 ] . Bipartite versions of the planted clique problem have also been 



extensively studied. Here a bipartite clique is planted in a random bipartite graph. A version of the 
bipartite planted clique problem has been used as a hard problem for cryptographic applications 
Applebaum et al. . 2010| ]. We now define the planted bipartite clique problem formally. 



Problem 1 (planted bipartite fc-clique). For 1 < k < n, let S C {1, 2, . . . , n} be a set of k vertex 
indices and D$ be a distribution over {0, l} n such that when x ~ D$, with probability 1 — (k/n) 
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the entries of x are chosen uniformly and independently from {0,1}, and with probability k/n the 
k coordinates in S are set to 1 and the rest are chosen uniformly and independently from {0,1}. 
The planted bipartite k-clique problem is to find the unknown subset S given access to samples 
from Ds- 

One can view the vectors x as adjacency vectors of a random bipartite graph with n vertices on 
one side and a planted bipartite clique with an expected k/n fraction of vertices on either side. This 
formulation captures the traditional bipartite planted clique problem when exactly n examples are 
drawn from D. In addition to planted clique, our lower bounds will also apply to planted dense 
subgraphs — here the probability of a coordinate in S being 1 is q > 1/2. Known algorithms for 
these problems require cliques (or dense subgraphs) of size k = Q(y/n). Our main result for this 
problem is a nearly matching lower bound for any statistical algorithm. 

Moment Maximization. Our second example is an optimization problem defined as follows. 

Problem 2 (moment maximization). Let D be a distribution over [— 1, l] n and let r 6 Z+. The 
moment maximization problem is to find a unit vector u* that maximizes the expected r 'th 
moment of the projection of D to u* , i.e., 

u* = arg max E [(u ■ x) r ]. 

u£ll n :\\u\\ = l x~D 

The complexity of finding approximate optima is interesting as well. For r = 2, an optimal 
vector simply corresponds to the principal component of the distribution D and can be found 
by the singular value decomposition. For higher r, there are no efficient algorithms kn own, and 
the problem is NP-hard for some distributions Brubaker . 20091 . Hillar and Li m], l200flj |. It can 



be viewed as finding the 2- norm of an r'th order tensor (the moment tensor of D). For r = 3, 



Frieze and Kannanl 20081 ] give a reduction from finding a planted clique in a random graph to 



this t ensor norm maximization problem; this was extended to general r in iBrubaker and Vempala 
2009]. Specifically, they show that maximizing the r'th moment (or the 2-norm of an r'th order 
tensor) allows one to recover planted cliques of size Q^n 1 ^). 

For moment maximization over a distribution that can be sampled, it is natural to consider the 
following type of optimization algorithm: start with some unit vector u, then estimate the gradient 
at u (via samples), and move along that direction staying on the sphere; repeat to reach a local max- 
imum. Unfortunately, over the unit sphere, the expected r'th moment function can have ( exponen- 



tially) many local maxima even for simple distributions. A more sophisticated approach [Kannan ] 
for both problems is through Markov chains or simulated annealing; it attempts to sample unit 
vectors from a distribution on the sphere which is heavier on vectors that induce a higher moment, 
e.g., u is sampled with density proportional to e^ u ^ where f{u) is the expe cted r'th moment alon 



u. This could b e implemented by a Markov chain with a Metropolis filter [Metropolis et all 1195 



Hastings! . 1970 ] ensuring a proportional steady state distribution. If the Markov chain were to mix 



rapidly, that would give an efficient approximation algorithm because sampling from the steady 
state likely gives a vector of high moment. At each step, all one needs is to be able to estimate 
f(u), which can be done by sampling from the input distribution. 

As we will see presently, these approaches fall under a class of algorithms we call statistical 
algorithms, and they will all have provably high complexity and nearly matching upper bounds. 
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2 Definitions and overview 



We now describe our model, approach for proving lower bounds and some applications in detail. 
2.1 Model 



The statistical query learning model of KearnsI 1998 ] is a restriction of the PAC model Valiant . 



19841 ] . It captures algorithms that rely on empirical estimates of statistical properties of random 
examples of an unknown function instead of individual random examples (as in the PAC model 
of learning). Here a statistical property refers to the expectation of any boolean function of an 
example with respect to the unknown distribution of examples. 

In the same spirit, for general search, decision and optimization problems over a distribution, 
we define statistical algorithms as algorithms that do not see samples from the distribution but 
instead have access to estimates of the expectation of any bounded function of a sample from the 
distribution. 

Definition 1 (statistical algorithms). Let D be the input distribution over the domain X . We say 
that an algorithm is statistical if it does not have direct access to samples from D, but instead 
makes calls to an oracle STAT/j, which takes as input any function h £ "H '■ X — > [— 1, 1] and a 
tolerance parameter r > 0. STAT d(/i,t) returns a value 

v e [h(D) -T,h{D) + r}. 

The most natural realization of a STAT/) oracle is one that computes h on 0(l/r 2 ) random 
samples from D and returns their average. In fact, as we will show later, 1/r 2 roughly corresponds 
to the sample complexity of a (usual) algorithm whereas the number of queries roughly corresponds 
to the running time complexity. 

The general algorithmic techniques mentioned earlier can all be expressed in this model in a 
relatively straightforward way. We would also like to note that in the PAC learning model some 
of the algorithms, such as the Perceptron algorithm, did not initially appear to fall in the SQ 
framework but SQ analogues were later fou nd for all known learnin g techniques except Gaussian 
elimination (for examples see KearnsI . 1998| and Blum et al. . 1997 ]). We expect the situation to 



be similar even in the broader context of search problems over distributions. 

The STAT oracle we defined can return any value within the given tolerance and therefore 
can make adversarial choices. We also aim to prove lower bounds against algorithms that use a 
potentially more benign, "honest" statistical oracle. The honest statical oracle gives the algorithm 
the true value of a boolean query function on a randomly chosen sample. Th is model ma kes the 
sample complexity explicit and is ba sed on th e Hon est SQ model in learning by Yang 200 ll ] (which 
itself is based on an earlier model of I.Tacksonl |200aj |). 



Definition 2 (honest statistical algorithms). Let D be the input be a distribution over the domain 
X. An honest statistical algorithm does not have direct access to samples from D, but instead 
makes calls to an oracle HSTAT/), which takes as input any function h £ % : X — > {—1, 1}. 
HSTAT/)(/i) takes an independent random sample x from D and returns h(x). 

Note that the HSTAT oracle draws a fresh sample upon each time it is called. Without re- 
sampling each time, an honest statistical algorithm could easily recover the sample bit-by-bit, 
making it equivalent to the usual access to random samples. The sample complexity of an 
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honest statistical algorithm is defined to be the number of calls it makes to the HSTAT oracle. 
Note that the HSTAT oracle can be used to simulate STAT (with high probability) by taking the 
average of 0(l/r 2 ) replies of HSTAT for the same function^ h. While it might seem that access 
to HSTAT gives an algorithm more power than access to STAT we will show that HSTAT can 
be simulated using STAT and also prove sample complexity lower bounds for honest statistical 
algorithms directly. 

We are now ready to formally define problems over distributions. 

Definition 3 (search problems over distributions). For a domain X, letD be a set of distributions 
over X, let T be a set of solutions and Z : T> — > 2^ be a map from a distribution D to a subset 
of solutions Z(D) C J 7 that are defined to be valid solutions for D. The search problem Z over T> 
and J- is to find a valid solution f G Z(D) given access to random samples from any D G D. 

We note that this definition captures decision problems by having T = {0, 1}. With slight abuse 
of notation, for a solution / £ T we denote by Z~ l (f) the set of distributions in T> for which / is 
a valid solution. 

For some of the optimization problems we consider, it is natural to let the solution space T 
contain real- valued functions over X and define the valid functions Z(D) = {/ G T \ E x ~D[f(x)] > 
E:e~d[/*(x)] — e}, where /* = maxj G j- Ex-d[/(^)], he., the set of functions that are within additive 
error e of being optimal. We refer to finding such a valid function as e-optimization. 



2.2 Statistical Dimension of search problems 



The main tool in our analysis is an information-theoretic bound on the complexity of statistical 
algorithms based on the structure of a search problem over a distribution. Our definitions and 
techniques draw heavily upon the statistical query (SQ) model in learning theory, wherein the 
complexity of a large class of learning algorithms (most known learning algorithms) is characterized 
via a single parameter called the SQ dimension. R oughly speak i ng, it corresponds to the number of 
nearly uncorrelated labeling functions in the class Blum et al. . 1994 . Kearnsl . 19981 ] . We introduce 
a natural generalization of this idea to search problems over arbitrary sets of distributions and 
prove a lower bound on the complexity of statistical algorithms based on the generalized notion. In 
addition, instead of relying on a bound on pairwise correlations, our dimension relies on a bound 
on average correlations in a large set of distributions. This weaker condition allows us to derive 
the tight bounds on the complexity of statistical algorithms for the planted /c-clique problem. 

We now define our dimension formally. For two functions f,g : X — )• TZ and a distribution D 
with probability density function D(x), the inner product of / and g over D is defined as 



(f,g) 



D 



E [f{x)g{x)\. 



The norm of / over D is ||/||_d = y (f, /)d- We remark that, by convention, the integral from the 
inner product is taken only over the support of D, i.e. for x G X such that D(x) ^ 0. We also note 
that if i = j above, the quantity (A — 1, ^jf — 1)d is known as the x 2 (Di,D) distance. For a set 
T>' of m distributions over X and a reference distribution D over X we define 



p(V',D) 



1 



77!/ 



E 

D 1 ,D 2 ev 



Di 
D 



1 



£2 
D 



1 



D 



1 Unlike HSTAT, STAT allows non-boolean functions that can be handled by first converting a real- valued query 
h to several boolean queries. 
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We are now ready to define the concept of statistical dimension. 



Definition 4. For 7 > 0, domain X and a search problem Z over a set of solutions T and 
a class of distributions T> over X , let d be the largest integer such that there exists a reference 
distribution D over X such that for every f € T there exists a set of m > distributions T>t = 
{Di, . . . , D m } C V \ Z~ l (f) satisfying the following property: for any subset V C Vj where 
\T>'\ > m/d, p(T>',D) < 7. We define the statistical dimension with average correlation 7 of Z 
to be d and denote it by SDA(iJ,7). 

The statistical dimension with average correlation 7 of a search problem gives a lower bound 
on the complexity of any statistical algorithm for the problem that uses queries of tolerance -^7. 

Theorem 1. Let X be a domain and Z be a search problem over a set of solutions J- and a class 
of distributions T> over X. For 7 > let d = SDA(Z, 7). Any statistical algorithm requires at least 
d calls of tolerance r = y/j to the STAT oracle to solve Z. 

It also gives a lower bound on the sample complexity of any honest statistical algorithm. 

Theorem 2. Let X be a domain and Z be a search problem over a class of solutions J- and a class 
of distributions T> over X. For 7 > let d = SDA(2,7). Any honest statistical algorithm that 
solves Z with probability greater than 13/14 requires at least 



samples from HSTAT oracle. 

The bound on the average correlation of large subsets upon which our notion is based can be 
easily obtained from a bound on pairwise correlations. Pairwise correlations are easier to analyze 
and therefore we now a define a special case of our statistical dimension based on pairwise cor- 
relations. This version can also be easily related to the statistical query dimension from learning 
theory (see Section [6]). Hence, we define a second notion of statistical dimension, which is easier to 
work with in some cases. 

Definition 5 (statistical dimension). For 7,/? > 0, domain X and a search problem Z over a set 
of solutions J- and a class of distributions T> over X. Let m be the maximum integer such that 
there exists a reference distribution D over X such that for every f E T there exists a set of m 
distributions Vf = {D\, . . . , D m } CP \ Z~ 1 (f) satisfying the following property: 



We define the statistical dimension with pairwise correlations (7,,$) of Z to be m and denote it 



A corresponding lower bound can be obtained as a corollary of Theorem [TJ 

Corollary 1. Let X be a domain and Z be a search problem over a set of solutions J- and a class 
of distributions T> over X. For 7, /3 > let m = ST)(Z, 7, /?). Any statistical algorithm requires at 
least m{T 2 — 7)/(/3 — 7) calls of tolerance r > to the STAT oracle to solve Z. 





(3 for i = j G [m] 
ifori^je [m]. 



by SD(^, 7 ,/3). 
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As we show in Section [3l this corollary follows by an appropriate choice of parameters. Fur- 
thermore, we can obtain a similar corollary for honest statistical algorithms (see Section T3, 21 Corol- 
lary 

T o conclude this section , we m ention that in related work in the context of convex optimiza- 
tion, iRaginskv and Rakhlinl 201 ll ] consider sequential optimization from noisy information and 
prove information-theoretic lower bounds. 



2.3 Lower bounds 

Our main lower bound is for the bipartite planted clique problem, for which we are able to show 
the following lower bound. 

Theorem 3. For any constant 5 > and any k < n 1 / 2 " 5 , at least n^ oglogn ^ queries of tolerance 
t = Vt(k/n) are required to find a planted bipartite clique of size k by any statistical algorithm. 

We note that this bound is close to tight. For every vertex in the clique, the probability that 
the corresponding bit of a randomly chosen point is set to 1 is l/2 + fc/(2n) whereas for every vertex 
not in the clique this probability is 1/2. Therefore using n queries of tolerance fc/(4n) it is easy to 
detect the planted clique. 

We also give a sample complexity lower bound. To place this bound in context, we note that 
it is easy to detect whether a clique of size k has been planted using 0(n 2 /k 2 ) samples: compute 
the average of Y17=i Xi 'i this wm De noticeably higher if a clique has been planted. Moreover the 
clique subset itself can be found with this number of samples via the eigenvector approach. The 
next theorem is a lower bound that applies to any statistical algorithm. In particular, it implies 
that for cliques of size smaller than y/n, one needs more than n samples for statistical algorithms 
to work. 

Theorem 4. For any constant 5 > and any k < n l / 2 ~ 5 , £l(n 2 /k 2 ) samples are required by any 
honest statistical algorithm to find a planted clique of size k. 

A closely related problem is the planted densest subgraph problem, where edges in the planted 
subset appear with higher probability than in the remaining graph. This is a variant of the densest k- 
subgraph problem, which itself is a natural ge neralization of fc-cl i que that asks to recover the denses t 
k- vertex subgraph of a given n- vertex graph Feigd . 12002 . iKhotl . [2004J, iBhaskara et al.l . [2010J, 120121 ] . 
The conjectured hardness of its average cas e variant, the planted de nsest subgraph problem, has 
been used in public key encry ption scheme s Applebaum et al. . 2O10l ] and in analyzing parameters 



20101 ]. Our lower bounds extend in a straightforward 



specific to financial markets [Arora et al 
manner to this problem. 

We next turn to other applications of statistical dimension to some natural optimization prob- 
lems over distributions. In particular, we show that any statistical algorithm for the moment 
maximization problem defined above, as well as distributional variants of MAX-XOR-SAT and 
fe-CLIQUE must have high complexity. 

Theorem 5. For the rth moment maximization problem let T be the set of functions indexed by 
all possible unit vectors u € TZ n , defined over the domain { — 1, l} n with f u (x) = (u ■ x) r . Let T> be 
the set of all distributions over {—1, l} n . Then for r odd and 5 > 0, at least t 2 ((") — 1) queries of 

tolerance r are required to (^ 2 (r+i) r ^ 2 ~ ^) '°P^ m ^ ze over -F an d ^ f or an V statistical algorithm. 
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In words, any statistical algorithm that maximizes the r'th moment (for odd r) to within roughly 
(r/e) 7 "/ 2 must have complexity that grows as ("). 

The MAX-XOR-SAT problem over a distribution is defined as follows. 

Problem 3 (MAX-XOR-SAT). Let D be a distribution over XOR clauses of arbitrary length, in n 
variables. The MAX-XOR-SAT problem is to find an assignment x that maximizes the number 
of satisfied clauses under the given distribution. 

In the worst case, it is known that M AX-XOR-SAT is NP-hard to approximate to within 
1/2 — e for any constan t e (Hastadl . 12001 ] . In practice, local search algorithms such as Walk- 



Sat Selman et all 119951 ] are commonly applied as heuristics for maximum satisfiability problems. 
We show that the distribution version of MAX-XOR-SAT is unconditionally hard for algorithms 
that locally seek to improve an assignment by flipp ing variables as to s atisfy more clauses, giving 



some theoretical justification for the observations of I Selman et al.l [19951 ], Moreover, our proof even 
applies to the case when there exists an assignment that satisfies all the clauses generated by the 
target distribution. 

Theorem 6. For the MAX-XOR-SAT problem, let J- be the set of functions indexed by all possible 
assignments in n variables and whose domain is the set of all clauses. (The value that such a func- 
tion takes when evaluated on a clause is the truth value of the clause under the given assignment.) 
Let V be the set of all distributions over clauses, then for 5 > 0, at least r 2 (2 n — 1) queries of 
tolerance r are required to — 5) -optimize over F and V for any statistical algorithm. 

Next, we consider the distribution version of the fc-clique problem. 

Problem 4 (distributional /c-clique). Let D be a distribution over graphs G. The \a-clique problem 
is to find a subset S of size k that maximizes the probability that S is a clique in G. 



Detecting whether a g raph has a clique of s ize k is NP-Hard [Karpl . Il972l ]. fixed-parameter 



intract able (hard for W [l] Downey and Fellows . 19991 ]) and no algorithm faster than 0(n ,792k ) is 



Nesetfij[l98i ]. even for a large constant k. While our lower bound does not give insight 



known 

into the computational hardness of /c-clique on worst-case inputs, it says that the fe-clique problem 
over a distribution on graphs has high complexity for any statistical algorithm. 

Theorem 7. For the distributional k-clique problem, let T be the indicator functions indexed by 
subsets Sofk vertices and whose domain is the set of all graphs on n vertices, that indicate whether 
S is a k-clique in the input graph. Let T> be the set of distributions over graphs on n vertices. Then 
for 5 > 0, at least t" 2 ((£) — 1) queries of tolerance r are required to ^2 _ (z) — S^j -optimize over T 
and T> for any statistical algorithm. 

3 Lower bounds from Statistical Dimension 

Here we prove the general lower bounds. In later sections, we will compute the parameters in these 
bounds for specific problems of interest. 
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3.1 Lower bounds for Statistical Algorithms 



We begin with the proof of Theorem [TJ 

Proof of Theorem [0 Let A be a statistical algorithm that uses q queries of tolerance r = to 
to solve Z over a class solutions T and class of distribution T>, such that SDA(£, 7) = m. Let D 
be the reference distribution for which the value d is achieved. We simulate A by answering any 
query h : X — > [—1, 1] of A with value Ed[/i(x)]. Let h\, h,2, ■ ■ • , h q be the queries asked by A in 
this simulation and let / be the output of A. 

By definition of SDA, there exists a set of m distributions T>f = {D\, . . . , D m } for which / is 
not a valid solution and such that for every T>' C Df, either p(T>',D) < 7 or \T>'\ < m/d. 

In the rest of the proof for conciseness we drop the subscript D from inner products and norms. 

To lower bound q, we use a generalization of an elegant argument of Szorenyi 2009| |. For every 
k < q let Ay. be the set of distributions Di such that | Efl[/ffc(x)] — E_Dj/ifc(x)]| > r. To prove the 
desired bound we first prove that following two claims: 

1- J2 k < q \ A k\ > m; 

2. for every k, \Ak\ < m/d. 

Combining these two immediately implies the desired bound q > d. 

To prove the first claim we assume, for the sake of contradiction, that there exists Di L)k<qAk- 
Then for every k < q, \ ~Eo[hk( x )] ~ Ei)J/ifc(x)]| < r. This implies that the replies of our simulation 
Ed are within r of 'ED i [hk( x )}- By the definition of A, this implies that / is a valid solution 
for Z on Di, contradicting the condition that Di G T> \ Z~ l (f). 

To prove the second claim, suppose that \Ay.\ > m/d 



E[h k (x)] 

Di 



B[h k (x)} = E 

D D 



B[h k {x)} 



Let Di(x) = -jj^y — 1, (where the convention is that Di(x) 
upper and lower bounds on the following quantity 



if D{x) = 0). We will next show 



/ h k , E A " sign(/i fc , A 

\ ieA k 

By Cauchy- Schwartz we have that 

(h k , J2 A-sign(%,A) ) < \\hk\f 



E Di ■ sign(/i fc , Di 



ieA k 




< \\hk\\ 2 -p(Ak,D)-\A k \ z . 



(1) 



S 



As before, we also have that 

/ \ 2 



2 



h k , ^2 A -sign(/i fe , A) > = ^ (/ifc, A) • sign(/i fc , A 



2 



^2 ( s 'g n (^' A)) • {h, Di 

> r 2 \Ak\ 2 =l\A k \ 2 . (2) 

By combining these two inequalities we obtain that ||/ife|| 2 • p(A/ i .,D) > r 2 , which for ||/ifc|| 2 < 1 
implies that p(Ak,D) > 7 which contradicts the definition of SDA. □ 

We now give the simple proof of the pairwise correlation version of statistical dimension-based 
lower bound (Corollary [p. 

Proof of CorollaryU^ Take d = m(T 2 — 7)/(/3 — 7); we will prove that SDA(Z,r 2 ) > d and apply 
Theorem [TJ Consider a set of distributions P'cP where \T>'\ > m/d = (/3 — 7)/(t 2 — 7): 



A. A 
— - l — - 1 
D D 



< l^ja (W + (l^f -P'l)7) 

/3- 7 



< 7 + 



<T 2 



IP' 



□ 

We can also use the same way to bound the average correlation to obtain a direct bound on 
SDA using a bound on SD. 

Corollary 2. Let X be a domain and Z be a search problem over a set of solutions T and a class 
of distributions T> over X . For 7, (3 > let m = ST)(Z,j, (3). Then SDA(Z, 27) > -w^z- 

The next corollary shows a setting of the parameters that is useful for our applications in Section 

El 

Corollary 3. Let X be a domain and Z be a search problem over a set of solutions T and a class 
of distributions T> over X. If for m> 0, SD(Z,7 = in - T ^,/3 = 1) > m then at least mV 3 /2 calls 
of tolerance m" 1 / 3 to the STAT oracle are required to solve Z. 



3.2 Lower bounds for Honest Statistical Algorithms 

Next we address lower bounds for the HSTAT Oracle. The quantity 1/t 2 can be thought of as 
representing the sample complexity of a statistical algorithm up to logarithmic factors. On one 
hand, q queries can be estimated to tolerance r using 0(log q/r 2 ) samples (with any constant prob- 
ability of success). On the other hand, J7(l/r 2 ) samples are necessary to estimate the expectation 
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of a "not-too-biased" query (with expectation bounded away from 1 and -1 by a constant) with 
constant probability of success. Strongly "biased" queries (such as a function which is identical to 
1 on every sample) can be estimated with fewer samples to tolerance r but our lower bound on 
the required tolerance can also be proportionately strengthened for such queries. We prove these 
points formally in our sample complexity lower bound for honest statistical algorithms. In this 
section it will be more convenient for us to assume that query functions used by the honest oracle 
are {0, 1} instead of {—1, 1}. This does not change the model in any way since we can replace the 
value — 1 with in the query function and then replace with —1 in the response. We will need 
the following two lemmas before proving Theorem [2J 

Lemma 1. For a query h : X — > {0, 1} and r = y/j, let A(h,r) be the set of distributions Di in 
V f such that |Ed[/i(x)] -EdAKx)]] > t^EdIHx)], 

\A(h,r)\ < m/d, 

where D, 7, T>$, m and d are as defined in Theorem\J\ and its proof. 

Proof. In the proof of Theorem[T]we obtain that \Ak\ < m/d whenever ||/ifc|| 2 • p(A/-,D) > r 2 . For 
r = y/j, we can also obtain the same conclusion under the condition ||/ifc|| 2 ■ p{Ak, D) > r 2 • ||/ifc|| 2 . 
In other words, we can obtain that \A^\ < m/d also when A^ is defined as the set of distributions 
Di such that | Ed [^0*0] — ~EDi[hk{x)]\ > r ' ll^fell- We now observe that for a {0, 1} function hk, 
||/ifc|| 2 = ~E,D[hk{x)\. This implies that, in the notation of our lemma, |t4(/i,t)| < m/d. □ 

Lemma 2. Let X ~ B(l,p). Then, for any p' G (0, 1), 

/ Pr[B(l,p) generated X] \ _ ^ (p — p') 2 
x \Pr[S(l,p') generated X]J p'(l — p') 

Proof. If X = 1, the ratio is p/p' and when X = 0, then it is (1 — p)/(l — p'). Thus, the expected 
ratio is 

p 2 {i-pf (p-p') 2 

r = 1 = I H . 

pi 1 — p' p'(\ — p') 

□ 

We are now ready for the proof of the main lower bound. 

Proof of Theorem^ Our generative model for HSTAT's interaction with an algorithm is as follows: 
HSTAT picks as the target D with probability 1/2 and with probability 1/2 picks a Di uniformly 
at random. Denote this random variable D. Upon a query of hj, HSTAT draws a sample xj from 
D, and responds with hj(xj). After q rounds, the algorithm outputs its best guess of D. Because 
D is drawn randomly, it makes sense to talk about the algorithm's success probability with respect 
to the randomness of D and xj . 

An equivalent model is as follows: there is some joint distribution over D and the possible 
responses of the HSTAT oracle. HSTAT will not choose D first, but will answer queries according 
to their marginal distributions: when the algorithm presents query hi, HSTAT returns an answer 
chosen according the marginal distribution of h\{xi) (obtained by integrating out the D variable). 
Subsequently, when the algorithm asks query hj, HSTAT responds according to the marginal dis- 
tribution of hj(xj) conditioned on the previous responses h\ {x\ ),..., hj-\{xj-\). After the q th 
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query HSTAT will pick D from the marginal conditioned on hi(xx), . . . , h q (x q ) and the algorithm 
will output a guess conditioned on h\(xi), . . . , h q (x q ). It is clear that this is equivalent to the first 
model, but it captures the sources of randomness and available information much better. We call 
this the joint model, and will use it to prove our honest statistical algorithm lower bound. 

Denote the result of the first j queries as u)j = (hi(xi), . . . , hj(xj)), and let B denote an 
algorithm which outputs a guess based on uo q : to maximise the probability that B's output and 
HSTAT's are the same: 

max Pr[B(oj q ) = D\u q ] 



s.t. ^Pr[5(w ? ) = AKl = l' 



We can rewrite the objective function as follows - B is adapted to uj q and is independent of D. 

Pr[B(u q ) = D\u q ] = ^Pr[£(u; ? ) = AN Pt[D = AN- 

A 

The optimal B is deterministic and picks the A with greatest conditional probability. By con- 
struction, B has this quantity as its success probability. Since the algorithm can do no better than 
picking maximum conditional probabilities as its output, we will assume that it in fact does so. 
Clearly, making the algorithm more powerful still preserves any lower bounds. 

We will analyze the conditional probability of D and show that this quantity never exceeds 7/8. 
The conditional probabilities can be rewritten by Bayes rule: 

Since the queries are adaptive, we define a random variable Hj for the choice of the j th query. We 
can then expand the conditional probability term. 

i 

Pr[/ii(xi), . . . ,h q (x q )\Di] = Y[ Pr[Hj = hj\Di,u}j-i,H u . . . Pr[hj(xj)\Di, ujj-x, H 1} . . .,Hj] 

i=i 

The Hj random variables and Pr[hi(xi), . . . , h q (xq)] are the same for each A> so we suppress these 
as a constant c. The hj{xj) are conditionally independent when A is fixed. In this case, each hj 
is a Bernoulli random variable with bias . 

Pr[^(xj-)|A] = {p)) h ^\l-p)) l - h ^ 
Therefore, the conditional probability is given by: 

q 

Pr[AM*i), ■ • • . h q (x q )} = cPr[A] H(p)) h ^\l - p)) l ~ h ^ 

i=i 

Let r = y/^. Using Lemma [Tj we can bound the size of A(hj,r) which consists of A's whose 
p l j are substantially different from that of D (which we shall denote by pj). The number of A's in 
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the union of A(hj,r) is at most qm/d. Thus, with q < <i/100, there are at least 99m/100 such Dj's 
remaining. 

For the remaining -Dj's, we know that \p % j — Pj\ < T^Efl^j] = T^/pj. We can always assume 
that pj < 1/2, since any query h such that Ed[/i] > 1/2 can be replaced with query 1 — h and 
the response then flipped by the algorithm. This implies that \p l j — Pj\ < T^/pJ < T^Jlpj(\ — pj). 
For every query j, we can now bound in expectation the increase in conditional probability using 
Lemma [2j The ratios change by at most 

1 + M^< 1 + 2r^(l-a) = 

Pj(l-Pi) Pj\l-Pj) 

in any round (in expectation). After q queries, the expected ratio is at most: 

(1 + 27)4 < 1.5 

for q < I/87. We can obtain concentration by using Markov's inequality. Hence, q > I/87. In 
particular, in relative terms, the conditional probability of D increases by a factor of at most 1.5. 
In particular, if we compare the conditional probability of D with the total conditional probability 
across all the other Dj, we obtain a comparison between Pr[D\hi(xi), . . . , h q (x g )] < 3/4c and 
^2 D _^ A Pr[Di\hi(xi), . . . ,h g (xq)] > 99/200c which yields that the conditional probability of D is 
strictly less than 7/8. Let A denote the algorithm's output, we have the following bounds 

PrL4 = D A D / D] + PrL4 ^ D A D ^ D] = 1/2 

Pr[A = DAD = D] < 1/2 

Pr[A = D A D = D] - 7Pr[A = DAD^ D] < 0. 

By taking a linear combination of these constraints in the ratio (1, 6/7, 1/7), we obtain the bound: 

-13 

Pr[A = D A D = D] + Pr[A ^D/\D^D]< — 

and that the success probability of the algorithm is bounded by 13/14. 
Thus, 



q > min 



87"' Too 



□ 



We conclude this section with an application of Corollary [2] to obtain a version of Theorem [2] 
for the simpler (pairwise) version of statistical dimension. 

Corollary 4. Let X be a domain and Z be a search problem over a set of solutions J- and a class 
of distributions V over X. For j,/3 > let m = SD(iJ,7, 0). Any deterministic honest statistical 
algorithm requires at least 

f 1 1 fm\ 
VI67' 60V7v 



mm 



samples from HSTAT oracle to solve Z . 
Proof. We have 



max mm 



1 1 m7o 



7o>7 V 1 1670 100 j3 — 70 
Solving for 70 and substituting, we get our bound. □ 
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3.3 Reductions between STAT and HSTAT 



We now show that access to the honest statistic al oracle is e ssentially equivalent to access to STAT. 
It has been observed in the context of learning YaneJ . 200ll ] that, given a boolean query function h 



one can obtain an estimate of Ed [A] using t = 0(log(l/<5)/r 2 ) honest samples which with probability 
at least 1 — 5 will be within r of EdM- We also allow real- valued query functions in our model 
but any such query function can be replaced by [log (1/t)] + 2 boolean queries each or tolerance 
t/2. A query i computes bit i of 1 + h(x) £ [0, 2] so only [log (1/t)] + 2 bits are necessary to get 
the value of h(x) within t/2. Combining these two observations gives us the following theorem. 

Theorem 8. Let Z be a search problem and let A be a statistical algorithm that solved Z using q 
queries of tolerance r. For any 5 > 0, there exists an honest statistical algorithm A' that uses at 
most 0(glog (q/(5r)) /r 2 ) samples and solves Z with probability at least 1 — 5. 

We also show a reduction in the other direction, namely that the STAT oracle can be used to 
simulate the HSTAT oracle. 

Theorem 9. Let Z be a search problem and let A be an honest statistical algorithm that solved 
Z with probability at least 5 using q samples from HSTAT. For any 5' there exists a statistical 
algorithm A' that uses at most q queries of tolerance 2 • 5' jq and solves Z with probability at least 
5(1-5'). 

Proof. A' simulates A as follows. Let hi : X — > {—1, 1} be the first query of A and let p = 
Ex~D[h(x)]. By asking the query STATd(/ii, t), for r = 5' jq we can get a value p' £ \p — T,p + r]. 
We flip a ±1 coin with bias p' (that is one that outputs 1 with probability (1 +p')/2 and —1 with 
probability (1 — p')/2. We return the outcome to A. One can think of the coin flip with bias p' as 
the coin flip with bias p and then a correction with probability \p' — p\/2. Namely, if p' > p then 
— 1 is output with probability (p' —p)/2 and otherwise 1 is output with probability (p — p')/2. This 
implies that our simulation can be seen as an honest simulation with a random correction step that 
happens with probability at most \p — p'\/2 < t/2 = 5' jq. 

We continue the simulation of the rest of A' queries analogously. By the union bound, the 
probability of a correction step happening during the simulation (and hence of our simulation 
differing from the honest one) is at most 5' , independently of other random events. Therefore A' is 
successful with probability at least 5(1 — 5'). □ 



4 Planted Clique 

We now prove the lower bound claimed in Theorem [3] on the problem of determining whether the 
given distribution on vectors from {0, l} n is just uniform or from a planted fc-clique distribution as 
defined above. 

For a subset S C [n], let Ds be the distribution with a planted clique on the subset S. Let 
{Si, . . . , S m } be the set of all (^) subsets of [n] of size k. For i £ [m] we use Di to denote Ds^ 
The reference distribution in our lower bounds will be the uniform distribution over {0, l} n and let 
Ds denote D$/D — 1. In order to apply our lower bounds based statistical dimension with average 
correlation we now prove that for the planted clique problem average correlations of large sets must 
be small. We start with a lemma that bounds the correlation of two planted clique distributions 
relative to the reference distribution D as a function of the overlap between the cliques. 
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Lemma 3. For i,j G [m], 

i - . \ 2 x k 2 
DuDA <— r . 



D 



where A = | Si H Sj \ . 



Proof. For the distribution Di, we consider the probability Di(x) of generating the vector x. Then, 



Di{x) 



2^)4 + ifVAeS,x A = l 



'2 n 1 W2 
;^)^r otherwise. 



Now we compute the vector Di = -Sf — 1: 



A_ 1= l *£-J ifVAG5,x A = l 

f — - otherwise. 



We then bound (Di,Dj / 



D ~ 2 n \n~n) + \2 n )\n~n)\n) + \n 



2 x k 2 
< — 

□ 

We now give a bound on the average correlation of any D$ with a large number of distinct 
clique distributions. 

Lemma 4. For k < 1/2 and A; < n K , Zef {5 1 !, . . . , Sm,} 6e the set of all rFj subsets of [n] of size k 
and {D\, . . . , D m } be the corresponding distributions on {0, l} n . Then for any integer £ < k, set S 
of size k and subset A C {5i, . . . , S m } where \A\ > 4(m — l)/n^ l ~ 2K \ 



1 1 Si€A 



k 2 
£+2 ft 



Proof. In this proof we first show that if the total number of sets in A is large then most of sets 
in A have a small overlap with S. We then use the bound on the overlap of most sets to obtain a 
bound on the average correlation of D$ with distributions for sets in A. 

Formally, we let a = ^ and using Lemma [3] get the bound (Di, Dj) < 2^ SiClS ^a. Summing over 
Si € A, 

^(Ds,D t )<^^ SnS ^. 

SiGA Si£A 

For any set A C {Si, . . . , S m } of size t this bound is maximized when the sets of A include S, then 
all sets that intersect S in k — 1 indices, then all sets that intersect S in k — 2 indices and so on 
until the size bound t is exhausted. We can therefore assume without loss of generality that A is 
defined in precisely this way. 



14 



Let 

T x = {Si | \SnSi\ = A} 

denote the subset of all /c-subsets that intersect with S in exactly A indices. Let Ao be the smallest 
A for which A n T\ is non-empty. We first observe that for any 1 < j ' < k — 1, 

ft | _ (•)(*!•) _ (j + i)( n - 2fc + j + 1) > (j + l)(n - 2fe) > (j + l)n^ 



a 



l T i+il Oil) (fc-J-t) " + 

By applying this equation inductively we obtain, 

. , 2 J '-|T | 2i-(m-l) 
\T-\ < — — < - - 

1 J j! • n( 1 " 2K )j j\ • n( 1 " 2K )j 

and 

V \t\< V 2> " (m ~ 1} < 4(m ~ ^ 

I A l ^ A !. n (l-2 K )A- n (l-2«)j • 

k>X>j k>X>j 

By definition of A , \A\ < Ej>A l T il < 4(m-l)/n^ 2/t ) A °. In particular, if \A\ > 4(m- l)/n^ 1-2/< ) 
then n^ 1 " 2 ^ /4 < rr 2k ) a o / /4 or < Now we can conclude that 

k 

J2(D S ,Di) < Yl 2 j \TjriA\a 
Si&A j=\ 

< (2 Ao |T Ao nA|+ 23 \ T i 

\ i=A +i 

< (2 A «|T Ao n^|+2-2 A « +1 |T Ao+1 |)a 

< 2 Xo+2 \A\a < 2 e+2 \A\a. 

To derive the last inequality we need to note that for every j > 0, 2>\Tj\ > 2(2i +l \T j+1 \) we can 
therefore telescope the sum. □ 

Lemma 0] gives a simple way to bound the statistical dimension with average correlation of the 
planted bipartite fe-clique problem. 

Theorem 10. For k < 1/2 and k < n K let Z the planted bipartite k-clique problem. Then for any 
£<k, SBA{Z,2 e+2 k 2 /n 2 ) > n^-^/A. 

Proof. Let {Si, . . . , S m } be the set of all (^) subsets of [n] of size k and V = {Di, . . . , D m } be 
the corresponding distributions on {0, l} n . For every solution S £ J-, Z^ 1 (S) = D$ and let 
V s = V \ {D s }. Note that \V S \ =m-l. 

Let V be a set of distributions V C T>s such that \D'\ > 4(m — l)/r/^~ 2K ' . Then by Lemma 
HI for every Si £ V , 

1 1 s 3 ev> 

In particular, p{V',D) < 2 e+2 ^. By the definition of SDA, this means that SDA(Z, 2 l+2 k 2 /n 2 ) > 
„<(i-2«)/4. □ 
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Theorems Q] and [10] imply the following corollary, as well as Theorem [3] 

Corollary 5. For any k < 1/2, k < n K and any £ < k at least n t<yl ~ 2K ^ /4 queries of tolerance 
t = 2^/ 2+1 ^ are required to solve the planted bipartite k-clique problem. In particular, for any 
constant k and £ = log log n we obtain that n^ loglogn ' , queries of tolerance r = Q(k/n) are required. 

Theorems [2] and [10] also imply the sample complexity lower bound stated in Theorem HJ 
4.1 Planted Densest Subgraph 

We will now show the lower bound on detecting a planted densest subset, a generalization of the 
planted clique problem. 

Problem 5 (planted bipartite densest fc-subgraph). For 1 < k < n, let S C {1,2, . . . ,n} be a set 

of k vertex indices and D$ be a distribution over {0, l} n such that when x ~ D$, with probability 
1 — {k/n) the entries of x are chosen uniformly and independently from {0, 1}, and with probability 
k/n the k coordinates in S are each, independently, set to 1 with probability q > 1/2 and the rest 
are chosen uniformly and independently from {0, 1}. The planted bipartite densest k-subgraph 
problem is to find the unknown subset S given access to samples from D$. 

We note that when p = 1 this is equivalent to the planted clique problem. For this problem, we 
are able to prove the following bound. 

Lemma 5. Let {Si, . . . , S m } be the set of all (^) subsets of [n] of size k for k <n K for k < 1/2 
and I < k with associated planted densest subgraph distributions {D\, . . . ,D m }. Then for any set 
S of size k and subset A C {Si, . . . , S m } where \A\ > m/d, 

1 / log(2m/d) \ 1,2 

— (D S , A) < 8 {2(q 2 (l - q) 2 )) ^ ^0 _ 1 _. 
1 1 SiEA ^ ' U 

Proof. Our planted sets on n coordinates will be of size k, with pairwise overlap A as before. The 
difference is to consider the probability q (as opposed to 1) of edges appearing in the plant in the 
calculation. We define 

Zs(x, q ) = q^(l-q) k -^ 

and consider 
Then the quantity 

^-l = 2^ S (x,q))- k -. 
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We need to compute 



xG{0,l} 

2 n In 



2 (2 2 ^ 5 (^, g)& 4 (s, 9) - 2 • 2^ 5 (x, (?) + l) 
x-e{o,i} n 



2™ V n J 

\ a:e{0,l} 2 

_ 2 . 2 fc 2 n-fc J2 i s {x,q)+2 n 
xe{o,i} k 

= 2^G) 2 ( 2n+V + (1 -' ?)2)A - 2 " 
h 2 h 2 

= V + (i - <z) 2 ) A - ~ 2 

The rest of the proof proceeds as in Lemma |H except that with the same choice of Ao, we obtain 

E A> < E a(2V + (i - g) 2 ) J - 1)1^ n A\ 

Si£A j=X 

(k k 
|T Ao n^|(2 A V + (l-9) 2 ) J -l)+ E y\TMq 2 + {l-q) 2 ) j - E l T il 
i=A +l i=A +l 

< [2 XQ \T Xo ^A\{{q 2 + (l-qfy -I) 

k 

+ 2.2 A » +1 |rAo + il(9 2 + (l-9) 2 ) A()+1 - E \ T o\ 

j=A +l 

<8(2 Ao (, ? 2 + (l-g) 2 ) Ao -l)\A\a. 



a 



a 



□ 



Theorem 11. For k < 1/2 and k < n K let Z the planted bipartite densest subgraph problem. Then 
for any £ < k, q > 1/2, 

/ log(2m/d) \ 

SDA I Z,8((2(g 2 (l -o) 2 ))^- 2 ") 10 ^ -1)1 >n £(1 ~ 2K) /4. 

With appropriate choices of parameter settings, we get the following Corollary. 

Corollary 6. For constants c, 5 > 0, density q < 1/2 + l/n c , and A; < n 1//2_<5 , any honest statistical 
algorithm requires (l((n 2+c ) / k 2 ) samples to find a planted densest subgraph of size k. 
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5 Other applications of Statistical Dimension 



In this section, we use Definition [5] together with the bound in Corollary [T] to get unconditional 
lower bounds for a variety of optimization problems. A recurring concept in our constructions will 
be a parity function, %. We first explore some properties of parity functions. 

Definition 6 (parity). For x G {0, l} n and c G {0, l} n , let Xc ■ {0, l} n -)■ {-1, 1}. 

Xc(x) = -(-I)- . 
Namely, Xc( x ) = 1 if c-x is odd, and — 1 otherwise. 

Note: for convenienc^, we will sometimes use x G {±l} n , in which case we abuse notation 
and define Xc(x) = — c =1 Xj. This corresponds to the embedding of x from {0, 1} — > { — 1, 1} of 
-> 1, 1 -> -1. 

Further, we define distributions uniform over the examples classified positive by a parity. 

Definition 7 (distributions D c ). Let x G {±l} n and c G {0, l} n and let S c = {x \ Xc(x) = 1}. We 
define D c to be the uniform distribution over S c . 

Lemma 6. For c G {0, l} n , c / and the uniform distribution U over { — l,l} n , the following 
hold: 

1) V \xAx)] = \ l lfC = C [ 2) E [Xc(x)xAx)] = \ l lfC = C ' 

x~d c I U otherwise. x~u I U otherwise. 

Proof. To show Part 1) note that if c = d then EzesJXc^)] = 1- If c ^ d ^ then it is easy 
to see that \S C n S c >\ = \S c \/2 = \S c >\/2 and so E xe s c [xc (x)] = E sG 5 c ns c , 1 + T,xes c \s c , = °- 
Part 2) states the well-known fact that the parity functions are uncorrelated relative to the uniform 
distribution. □ 

These two facts will imply that when D = U (the uniform distribution) and the D^s consist of 
the -D c 's, we can set 7 = and (3 = 1, when considering the statistical dimension of the problems 
presented in the following sections. 



5.1 MAX-XOR-SAT 

We first formalize the MAX-XOR-SAT problem introduced in Problem [3) Let D be a distribution 
over XOR clauses c G {0, l} n . We interpret a = 1 as variable i appearing in c and otherwise 
not; for simplicity, no variables are negated in the clauses. The problem is to find an assignment 
x G {0, l} n that maximizes the expected number of satisfied XOR clauses. We now give the 
statistical dimension of this problem, from which Theorem [6] follows. 

Theorem 12. For the MAX-XOR-SAT problem, let F = {Xx} x e{o,i}™ , let V be the set of all 
distributions over clauses c G {0, l} n , and for any 5 > 0, let Z be the problem of (i — 5)-optimizing 
over T and V. Then SD(£, 0, 1) > 2 n - I. 

2 For the moment maximization problem, it is necessary for our argument that examples x be € { — 1, l} n , whereas 
for MAX-XOR-SAT, the argument is much cleaner when x is in {0, 1}". It is, therefore, natural to use the same 
notation for the corresponding parity problems. 
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Proof. Maximizing the expected number of satisfied clauses is equivalent to maximizing the quantity 

max E [Xx(c)]. 

x€{0,l} n c~D 

This proof is a fairly direct application of Lemma [6] to the definition of statistical dimension. 

For the conditions in Definition [5J for each each of the 2 n possible assignments to x let D x be 
the uniform distribution over the clauses c G {0, 1}™ such that Xc(%) = 1- 

Because Xc{x) is symmetric in x and c, the conditions in Definition (U with /3 = 1 and 7 = 0, 
which follow from Lemma [6l are satisfied for the 2 n distributions D c , with D = U. Because 
Xc( x ) = 1 when assignment x satisfies clause c and —1 otherwise, we need to scale the approximation 
term by 1/2 when measuring the fraction of satisfied clauses □ 

Corollary 7. Any statistical algorithm for a MAX-XOR-SAT instance asymptotically requires 
2«/3 q Uer i es f tolerance 2~ n / 3 to find an assignment that approximates the maximum probability 
of satisfying clause drawn from an unknown distribution to less than an additive term of 1/2. 

5.2 k- Clique 

We first formalize the distributional fc-clique problem. Let D be a distribution over X = {0, 1}^ 2 ^, 
corresponding to graphs G on n vertices. For G £ X, let 

Is(G) = 



1 if S induces a clique in G 
otherwise. 



The fc-clique problem is to find a subset S C V of size k that maximizes Eg~d[Ts , (G)]. 

We now give the statistical dimension of distributional /c-clique, from which Theorem [7] follows. 

Theorem 13. For the distributional k-clique problem, let T = {Is}\s\=k> ^ D be the set of 
distributions over graphs on n vertices, and for any 5 > 0, let Z be the problem of MZ - ^) — S^j - 
optimizing over T and T>. Then SD(iJ,0, 1) > (?) — 1. 

Proof. We shall compute the statistical dimension of distributional /c-clique with e = 2~( 2 ) — $ (for 
5 > 0), 7 = 0, and (3 = 1 and show it is (£). 

For any subset of edges T £ V x V , and graph G E X, we can define the function 

1 if \E(G) n T\ has the same parity as (2) 
— 1 otherwise. 



parity T (G, k) 



Note that parity T (G,fc) = (-1)0^)^1 +(2)). 

As both T and G lie in {0, l}^ 2 ), note that parity T (G, k) is simply xt{G) or (its negation, 
depending on k). Let Ti, . . . , be all the (?) cliques on A; vertices. We generate the distributions 
Di, . . . , Dd so that Di is uniform on the graphs G such that l-E'(G) n Tj| = (2) mod 2. The 
distribution T> is the uniform over all graphs G. By Lemma El these choices justify (3 = 1,7 = 0. 

We notice that the set of vertices of the clique T, maximizes Ez>J-/s(G)] while the set of edges 
of the clique maximizes Ez>; [parityj^G, k)], namely we have that 



V(Ti) = axg max ( E fe(G)l) = V arg max ( E [parity T (G, fc)l) 
5GV:|S|=fc G~D/ V TeVxV G~Di 
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By definition EG~Djparity r (G, k)] < 1, with equality iff T = T{. 

For Si = V(Ti) we have that Is^G) = 1 iff Tj is a clique in G. Since any setting of the edges not 
in Tj appears equiprobably under Dj and since there are 2(a) -1 possible settings for edges between 
vertices in V(T$) occurring equiprobably in graphs from Di, it follows that Eg~a [I Si (G)] = 2~( 2 ) +1 . 

On the other hand, if Sj 7^ V(Tj) then all subsets of edges among the vertices of Sj appear 

equiprobably under Di. Hence, for j / i, Eg^dJ-Ts 1 ,] = 2~(a) 5 as only 1 of every 2(2) subgraphs on 

k vertices forms a clique. This allows us to set e = 2~ w — § : for any 5 > 0. 

Because our distributions were generated by the k vertex subsets, we have shown the statistical 
dimension to be (?) — 1. □ 

1 /3 

Corollary 8. Any statistical algorithm for a k-clique instance asymptotically requires (T) queries 
of tolerance (?) ^ 3 to find an assignment that approximates the maximum probability of satisfying 
clause drawn from an unknown distribution to less than an additive term of 2~(^ . 

5.3 Moment Maximization 

We recall the moment maximization problem. Let D be a distribution over {—1, l} n and let r £ Z + . 
The moment maximization problem is to find a unit vector u that maximizes ~E X ~D [( u ' X Y}- 
Before going to the main proof, we need to prove a property of odd moments. 

Lemma 7. Let r G Z + be odd and let c € {0,l} n . Let D c be the distribution uniform over 
x e {-1, l} n for which X c(x) = -1. Then, Vu G K n , 

E [{x ■ uf] = rl u { . 
x~D c .- LJ - 1 

V. Cj=l 

Proof. From Lemma [8] we have that 

Vner, E {{x-u) r ] = r\ T\ Ul + E [(x-uY]. (3) 
*~d c .- L c .i 1 z 6 {±i}« 

the lemma follows now since when r is odd 



E [{x-uY}= E [((-x)-uY] = 0. 
ze{±i} n ice{±i} n 



□ 



Lemma 8. Under the conditions of Lemma^ 



VueK n , E \{x-u) r ] = r\ T\ Ui+ E [(x-u) r ]. (4) 

Proof. Notice that 



E [(a; • u) r ] = - E (x ■ u) r + - E (x • u) T 

xe{±i} ?l 2 Xc ( x )=-i 2 Xc (as)=i 
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and that 



therefore 



E Xc(x)(x ■ u) r = - E (x ■ u) r - i E (x ■ u) r 

xG{±l}« 2 Xc (x)=-1 2 Xc (z)=l 



E [(x-tt) r ] = E [{x-u) r ]- E Xc(x)[(x • n) r ]. 

a;~D c xg{±l} ?l a;G{±l} n 



Equation [4] follows now by Lemma [9] below. 



□ 



Lemma 9. Let c be an r parity on the variables indexed by set I = . . . , i r }, c G {0, l} n . Let u 
be an arbitrary vector in 1Z n . Then 

1. E ie {±i } " = B[Xc(x)(x ■ uf] = fori<r 

2. E^iiijn = E[Xc(x)(x ■ u) r \ = -r\ ]l ie/ Ui . 
Proof. To prove Part 1, we have 



E [xc(x)(x ■ cf 



E 



h+...tn=i V ' ' 7 ie[r] 



E 



t\+...t n =i 



t\ i • • • > tn 



E 



Xc{x) \\ {UiXi) U 



Notice that if there is some variable j £ I such that tj = then ~E x [Xc(x) IlieM i u i x i) ti ] = ^> as ^ ne 
term corresponding to x always cancels out with the term corresponding to the element obtained 
by flipping the jth bit of x. Since i < r every term n«G[r] ( n » x *)* 1 mus t contain some tj = with 
j £ I, which concludes that ~E[Xc(x)(x ■ c) 1 } = 0. 

To prove Part 2 of the lemma, we will induct on n. For n = r, 



B[Xc(x)(x ■ u) 



E 



t!+...U=r V !' - •' r / i6 r r] 



E 



ti+...t r =r 



I , . . . , L r 



E 



As before, if some = and j £ I = [r] then E[XcO*0 IlieW ( -u * 2; «)* 1 ] = 0' s i nce f° r each x and x 
obtained by flipping the jth bit of x it is the case that Xc(x) = — x c (x). Therefore 



B[Xc(x)(x -u) r 



E 



* c Ki,i, r ..,i)n 
:(n«*)B[n^] 



UjXj 



Assume now the identity holds for n. Let c E {0, l} n+ and let j /, and for x G {0, 1} 
define x_j G {0, l} n to be x with the jth bit punctured. 



m+l 
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Then 



E [Xc{x)(x ■ u) r ] = E[xc(x)(x-j • u-j + x j u j ) r ] 



E 



Xc(x) \M x -r u -i) r l ( x ^ 



E[xc(x){x^ r u^) r ] + E 



x, 



Ki<r ^ ' 



e (.)e[x c (x)( 

iel l<i<r ^ ' 



x-j • u^y l (x jUj y] 



(5) 



If i is even then 



E [xc(x)(x-j • n_ j ) r 'fouj)*] = (ujyE[xc(x){x-j • u_ i ) r *] = 
by Part 1 of the lemma. If i is odd then 
E [xc(x)(x-j ■ u-jY-^XjUjf] 



«}E [Xe(x)(X-j ■ U-j) r { Xj] 

A 1 



II 



'2 



- E [Xc(s)(5 



E [xc(x)(a 



0, 



since j I and so Xc(x) = Xc(x), where x is obtained from x by flipping the jth bit. We can now 
conclude that Equation © = — r\Y[ i£l Ui. □ 

Corollary 9. Let r £ Z + be od^ and let c E {0, l} n . Let D c be the distribution uniform over 
x G {—1, l} n /or which Xc( x ) = — 1- Then, E x ~D c [(x ■ v) r \ is maximized when u = r~ l / 2 c. 

Proof. From Lemma El clearly whenever a = 0, we have Uj = 0. It follows from the AM-GM 
inequality that the product is maximized when the remaining coordinates are equal. □ 

Now we are ready to show the statistical dimension of moment maximization, from which 
Theorem [5] follows. 

Theorem 14. For the rth moment maximization problem let T = {(u ■ xY}u£ii n an d ^ T> be 
a set of distributions over { — 1, l} n . Then for an odd r and 5 > 0, let Z denote the problem of 

( 2(r+i) r / 2 ~~ ^) ~°P^ m ^ n 9 over F an d Then SD(Z, 0, 1) > (™) — 1. 

Proof. Let D\, . . . , be distributions where Di is uniform over all examples x in {0, l} n , where 
such that Xci(x) = 1; this again allows us to consider /3 = 1 and 7 = 0. 
Corollary [9] shows that under the distribution D{, the moment function 

max E [(it • x) r ] 

u&Ti:\\u\\=l x~Di 



This statement does not hold for r even. 
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is maximized at u = r~ l / 2 c. So, to maximize the moment, one equivalently needs to find the correct 
target parity. 

To compute the needed e, for r odd, Lemma [7] tells us that the expected moment is simply 
r! rj i;c =1 m, and for unit vectors, is maximized when : Cj = 1, Ui = r -1 / 2 (and m = for the 
other coordinates). This yields a maximum moment of (r!)r _r / 2 for any Dj. 

In comparison, if the measured moment is equal to (r!)(r + l)~ r / 2 , a simple consequence of 
LemmaOis that to minimize Y2i-a=i u i> then for all % s.t. Cj = 1, we have ui = (r + l)" 1 / 2 . Hence, 
for all i s.t. Cj = 0, Ui cannot take value greater than 1 — r((r + l)" 1 / 2 ) 2 = (r + l) -1 / 2 , implying a 
moment of at most (r!)(r + l) _r / 2 on D c /. This gives a bound of 

e > {r\)r- r l 2 -{r\){r + l)- r ' 2 > - H — . 
- v ; y - 2(r + l)>V 2 

The ( n J parities generating the different distributions give the statistical dimension. □ 

Corollary 10. For r odd, any statistical algorithm for the moment maximization problem asymp- 
totically requires queries of tolerance (™) to approximate the r-th moment to less than 
an additive term of —, fh^rp> ■ 



6 Relationship to Statistical Queries in learning 



We will now use Corollary [TJ to demonstrate that our work generalizes the notion of statistical 
query dimension and the statistical query lower bounds from learning theory. In an instance of a 
PAC learning problem, the learner has access to random examples of an unknown boolean function 
/' : X' — > {—1,1} from a set of boolean functions C (whenever necessary, we use ' to distinguish 
variables from the identically named ones in the context of general search problems). A random 
example is a pair including a point and its label (x' , c(x')) such that x' is drawn randomly from an 
unknown distribution D'. For e > 0, the goal of an e-accurate learning algorithm is to find, with 
high probability, a boolean hypothesis h! for w hich Pr^i^n'l h'jx') ^ f'(x')] < e. 

A statistical query (SQ) learning algorithm Kearnsl . 19981 ] has access to a statistical query oracle 
for the unknown function /' and distribution D' in place of random examples. A query to the SQ 
oracle is a function (j> : X' X {— 1, 1} — >■ [—1, 1] that depends on both the example x' and its label 
£. To such a query the oracle returns a value v which is within r of Ez?' [<I>( x 'i c ( x ')]i where r is the 
tolerance parameter. A SQ algorithm does not depend on the randomness of examples and hence 
mu st always su c ceed. 



Blum et al.l 19941 ] defined the statistical query dimension or SQ-DIM of a set of functi ons C 



and d istribution D' over X' as follows (we present a simplification and strengthening due to lYane 
20051 ] V 



Definition 8 (jBlum et all [l994^ . For a concept class C and distribution D' , SQ-DIM(C, D') = d! 
if d! is the largest value for which there exist d' functions c\ , c% , . . . , € C such that for every 
i j> \(ci,Cj) D >\ < 1/d'. 

Blum et al.l |l994j [ proved that if a class of functions is learnable using only a polynomial number 



of sta tistica l quer ies of inverse polynomial tolerance then its statistical query dime nsion is pqlyno- 
mial. Yangi [2005 ] strengthened their result and proved the following bound (see Szorenyi . 20091 ] 
for a simpler proof). 
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Theorem 15 (jYanel [2005| l). Xet C be a class of functions and D' be a distribution over X' and 



let d' = SQ-DIM(C,D'). Then any SQ learning algorithm for C over D' that makes q queries 
of tolerance l/cf 1//3 and outputs an e-accurate hypothesis for e < 1/2 — l/(2d' 1 / 3 ) satisfies that 
q > d' l / z /2 - 1. 

In this result the distribution D' is fixed and known to the learner (such learning is referred 
to as distribution- specific) and it can be used to lower bound the complexity of learning C even in 
a weak sense. Specifically, when the learning algorithm is only required to output a hypothesis h! 
such that ~Pr x i^r)'[h' (x') ^ c(x')\ < 1/2 + 7' f° r some inverse polynomial 7' (or e < 1/2 — 7'). 

We now claim that we can cast this learning problem as an optimization problem, and by doing 
so, we will o btain that ou r statistical dimension implies a lower bound on learning which is stronger 
than that of I Yang] |2005l |. 



Let C = (C, D' , e) be an instance of a distribution-specific learning problem of a class of functions 
C over distribution D' to accuracy 1 — e. We define the following 2e-optimization problem Zjr 
over distributions. The domain is all the labeled points or X = X' x { — 1, 1}. When the target 
function equals c G C the learning algorithm gets samples from the distribution D c over X, where 
D c (x' ,c(x')) = D'(x') and D c (x' , —c(x')) = 0. Therefore we define the set of distributions over 
which we optimize to be Dc = {D c \ c G C}. Note that STAT oracle for D c with tolerance r is 
equivalent to the statistical query oracle for c over D' with tolerance r. We can take the class 
of functions Tc over which a learning algorithm optimizes to be the set of all boolean functions 
over X of the form f(x',£) = f'(x') ■ I for some boolean function /' over X' (an efficient learning 
algorithm can only output circuits of polynomial size but this distinction is not important for our 
information-theoretic bounds). We define to be the problem of 2e-optimizing over Tc and T>c- 
Note that for / G Tc and D c G T>c, 

E [/(*)] = E[/V) • c{x')] = 1 - 2E[/V) + <x')] 

D c D> D> 

and therefore learning to accuracy 1 — e is equivalent to 2e-optimizing over Tc and T>c- 

We claim that SQ-DIM(C, Z)')-based lower bound given in Theorem 1 151 is effectively just a minor 
learning-specific simplification of our statistical dimension lower bound for (Cor. [1]). 

Theorem 16. Let C be a class of functions and D' be a distribution over X' and let d! = 
SQ-DIM(C,D'). Denote by C = (C,D',e) the instance of learning C over D' for e = 1/2-1/ \2d a/3 ). 
Then 



SD(^ = 1/0 = 1)>(./- ^ ) 



Proof. Let c\, C2, . . . , be the almost uncorrelated functions in C implied by the definition of 
SQ-DIM(C, D'). We define the reference distribution D as the distribution for which for every 
(x',£) G X, D(x',£) = D'(x')/2. We note that this ensures that D(x',£) is non-vanishing only 
when D'(x') is non-vanishing and hence the function (-^j — l) will be well-defined for all c G C. 
For every c G C, we have 

D y,c(xQ) 0( y,- C M) 
D(x',c(x')) D(x',-c(x')) 
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Therefore, 



■ c{x'). This implies that for any Ci,Cj G C, 



Hence 

1. for any c G C, 



1. 



c j_ 

D 



Dc{x) 



= E[£ • Ci{x') ■ £ ■ Cj {x')} = E[ci{x') ■ Cj (x')] 

D D D' 



(Ci,0 



■j)D'- 



2. for any % ^ j < d', 



- 1 

DcAx) 



Dc-(x) \ 



These properties imply that d! functions in C give d' distributions in T>c whose distinguishing 
functions are almost uncorrelated. This is essentially the condition required to obtain a lower 
bound of d' on SD(i?£, 1/d' , 1). The only issue is that we need to exclude distributions for which 
any given / G T& is 2e-optimal. We claim that it is easy to bound the number of distribution 
which are 2e-optimal for a fixed f(x',£) = f'(x') ■ £ and whose distinguishing functions are almost 
uncorrelated. First, note that the condition of 2e-optimality of / for D c states that 

E[/(x)l > 1 - 2e > l/d' 1 ' 3 . 

On the other hand, Ed[/(x)] = and therefore Ed c [/(^)] — Ed[/(i)] > 1/d' 1 / 3 . This implies that 
if we view f{x) as a query function then expectations of the query function relative to D c and D 
differ by at least r = 1/d' 1 / 3 . In the proof of Corollary [H we proved that this is possible for at most 
W ~ i)/{ t2 ~ l) distributions with pairwise correlations (7,/?). For our parameters this gives a 
bound of 



i/d' 2 / 3 -l/d' 
there exist m distributions D\, 



distributions. Hence for m = d' 



i/d' 2 / 3 -i/d- 

D m Q{D cl ,..., D Cd , }\Zl 1 {f) such that 



- we obtain that for every / G Tc 



1. for any i < m, 



1 



I) 



2. for any i ^ j < m, (§g - 1, 



1) < 1/d'. 



□ 



Appl ying Corollar y [Tj we get the following lower bound, which is twice larger than the d' 1 / 3 /2 — 1 
bound of lYand [2(1)51 ] . 



Corollary 11. LetC be a class of functions and D' be a distribution over X' , letd' = SQ-DIM(C, D') 
and let e = 1/2 - l/(2d' 1 / 3 ). Then any SQ learning algorithm requires at least d' 1 /' 3 — 2 queries of 
tolerance 1/d' 1 / 3 to e-accurately learn C over D' . 



6.1 Honest Statistical Queries 



We now turn to the Honest SQ model [Jacksonl . 120031 . lYangj . I2001I ]. which inspired our notion of 
statistical sampling algorithms. In the Honest SQ model, the learner has access to an HSQ oracle 
and can again evaluate queries which are a function of the data points and their labels. As in our 
HSTAT oracle, the queries are evaluated on an "honest" sample drawn from the target distribution. 
More precisely, the HSQ oracle accepts a function <p : X' x { — 1, 1} — > {—1, 1} and a sample size 
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t > 0, draws x[, . . . , x' t ~ D', and returns the value \ Yll=i cO*-'))- The total sample count of 
an algorithm is the sum of the sample sizes it passes to HSQ. 

We note that using our one-sample-per-query-function oracle HSTAT one can simulate estima- 
tion of queries from larger samples in the straightforward way while obtaining the same sample 
complexity. Therefore HSQ is equivalent to our HSTAT oracle. 

We first observe that our direct simulation in Theorem [9] implies that the Honest SQ learning 
model is equivalent (up to polynomial factors) to t he SQ l e arnin g model. We are not aware of 



this observation having been made before (although Valiant (2009 ] implicitly uses it to show that 



evolvable concept classes are also learnable in the SQ model). 

We now show that using Corollary |4] we can derive sample complexity bounds on honest statis- 
tical query algorithms for learning. 

Corollary 12. Let C be a class of functions, D' be a distribution over X', d! = SQ-DIM(C,D') 
and e = 1/2 — 1/ (2c?' 1 / 3 ). Then the sample complexity of any Honest SQ algorithm for e-accurate 
learning of C over D' is £l(-\fd!). 



This recovers the bound in lYangl 20051 ] up to polynomial factors 
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